---
title: "Blackwell Electronics"
subtitle: "Customer Analytics Report — Purchasing Behaviour & Demographics"
author: "Freda Erinmwingbovo"
date: today
format:
html:
theme: [darkly, custom.scss]
toc: true
toc-depth: 3
toc-title: "Report Contents"
toc-location: left
number-sections: true
code-fold: true
code-tools: true
code-summary: "Show Code"
df-print: paged
page-layout: article
smooth-scroll: true
self-contained: true
execute:
warning: false
message: false
---
```{=html}
<div class="cover-header">
<div class="cover-top-bar">
<span class="cover-company">BLACKWELL ELECTRONICS</span>
<span class="cover-tag">CONFIDENTIAL INTERNAL REPORT</span>
</div>
<div class="cover-main">
<div class="cover-label">Customer Analytics</div>
<h1 class="cover-title">Purchasing Behaviour<br/>& Demographics</h1>
<p class="cover-subtitle">
An investigation into regional spending patterns, customer demographics,
and predictive modelling of purchase behaviour across 80,000 transactions.
</p>
</div>
<div class="cover-bottom-bar">
<div class="cover-meta">
<span class="meta-label">PREPARED BY</span>
<span class="meta-value">Freda Erinmwingbovo</span>
</div>
<div class="cover-meta">
<span class="meta-label">PREPARED FOR</span>
<span class="meta-value">Danielle Sherman, CTO</span>
</div>
<div class="cover-meta">
<span class="meta-label">DATE</span>
<span class="meta-value">February 2026</span>
</div>
<div class="cover-meta">
<span class="meta-label">RECORDS ANALYSED</span>
<span class="meta-value">80,000 Transactions</span>
</div>
</div>
</div>
<style>
.cover-header {
background: linear-gradient(135deg, #0a0a0f 0%, #12121a 50%, #0d0d1f 100%);
border: 1px solid #2a2a3a;
border-left: 4px solid #00d4ff;
border-radius: 8px;
margin: 0 0 3rem 0;
overflow: hidden;
position: relative;
}
.cover-header::before {
content: 'BWE';
position: absolute;
right: -10px;
top: 50%;
transform: translateY(-50%);
font-size: 12rem;
font-weight: 900;
color: rgba(0, 212, 255, 0.03);
font-family: 'Cormorant Garamond', serif;
line-height: 1;
pointer-events: none;
}
.cover-top-bar {
display: flex;
justify-content: space-between;
align-items: center;
padding: 14px 32px;
background: rgba(0, 212, 255, 0.05);
border-bottom: 1px solid #2a2a3a;
}
.cover-company {
font-family: 'JetBrains Mono', monospace;
font-size: 0.72rem;
letter-spacing: 4px;
color: #00d4ff;
font-weight: 500;
}
.cover-tag {
font-family: 'JetBrains Mono', monospace;
font-size: 0.65rem;
letter-spacing: 2px;
color: #8a8a9a;
}
.cover-main {
padding: 52px 32px 40px;
text-align: center;
}
.cover-label {
font-family: 'JetBrains Mono', monospace;
font-size: 0.7rem;
letter-spacing: 4px;
text-transform: uppercase;
color: #7b61ff;
margin-bottom: 16px;
text-align: center;
}
.cover-title {
font-family: 'Cormorant Garamond', serif !important;
font-size: clamp(2.2rem, 5vw, 3.4rem) !important;
font-weight: 700 !important;
line-height: 1.15 !important;
color: #ffffff !important;
margin-bottom: 24px !important;
border: none !important;
padding: 0 !important;
text-align: center;
}
.cover-subtitle {
font-size: 0.95rem;
color: #8a8a9a;
max-width: 580px;
line-height: 1.8;
margin: 0 auto;
}
.cover-bottom-bar {
display: flex;
gap: 0;
border-top: 1px solid #2a2a3a;
flex-wrap: wrap;
}
.cover-meta {
display: flex;
flex-direction: column;
gap: 5px;
padding: 20px 32px;
border-right: 1px solid #2a2a3a;
flex: 1;
min-width: 160px;
}
.meta-label {
font-family: 'JetBrains Mono', monospace;
font-size: 0.62rem;
letter-spacing: 2.5px;
color: #8a8a9a;
text-transform: uppercase;
}
.meta-value {
font-size: 0.88rem;
font-weight: 500;
color: #e8e8e8;
}
</style>
```
# Introduction and Data Loading
## Background & Objectives
Blackwell Electronics has operated as a successful electronics retailer for over three decades. Following the launch of their eCommerce platform just over a year ago, the business has begun accumulating valuable customer transaction data, both online and in-store.
This report presents a data-driven investigation into customer purchasing behaviour, commissioned by Danielle Sherman (CTO) and informed by the business questions raised by VP of Sales, Martin Goodrich. The analysis is structured around three core questions:
1. **Regional Spending**: Do customers in different regions spend more per transaction? Which regions spend the most and least?
2. **Demographics & Channel**: Are older customers more likely to shop in-store than online? What factors predict how a customer shops?
3. **Predictive Modelling**: Can we predict a customer's region, their purchase channel, or their age from the available data?
The dataset contains **80,000 customer transactions**, capturing age, number of items purchased, transaction amount, purchase channel (online vs. in-store), and region.
## The Libraries Installed
The analysis draws on R's `tidyverse` ecosystem for data manipulation and visualisation, and the `caret` framework for machine learning; an industry-standard approach for building and evaluating predictive models.
```{r}
#| label: load-libraries
#| message: false
#| warning: false
# Data manipulation and visualisation
library(tidyverse) # Core suite: dplyr, ggplot2, readr, tidyr etc.
library(scales) # For formatting axes (currency, percentages)
library(knitr) # For clean tables in the report
library(kableExtra) # For styled, professional tables
# Machine learning
library(caret) # Classification and regression training framework
library(randomForest) # Random Forest algorithm
library(nnet) # Multinomial logistic regression (for region prediction)
# Report styling
library(ggthemes) # Additional ggplot2 themes
```
## Importing the Data
```{r}
#| label: load-data
# Import the customer transaction dataset
df <- read_csv("Demographic_Data.csv")
# Rename columns for clarity and consistency
df <- df %>%
rename(
channel = `in-store`,
age = age,
items = items,
amount = amount,
region = region
)
# Convert categorical variables to factors with meaningful labels
df <- df %>%
mutate(
channel = factor(channel, levels = c(0, 1),
labels = c("Online", "In-Store")),
region = factor(region, levels = c(1, 2, 3, 4),
labels = c("North", "South", "East", "West"))
)
# Preview the data
glimpse(df)
```
# Data Preprocessing
Before any analysis can begin, it is essential to examine the quality and structure of the data. Raw data even from internal systems can contain errors, inconsistencies, missing values, or extreme observations that could distort results and lead to misleading conclusions.
This section documents every step taken to inspect, clean, and validate the dataset prior to analysis. All decisions made during preprocessing are explicitly justified so that the analytical process is fully transparent and reproducible.
## Data Inspection
The first step is simply to look at the data carefully: its dimensions, structure, data types, and a statistical snapshot of every variable. This gives us an immediate sense of what we are working with and flags anything that needs attention before we proceed.
```{r}
#| label: data-inspection
# --- Dimensions ---
cat("Rows:", nrow(df), "\n")
cat("Columns:", ncol(df), "\n")
# --- First 6 rows ---
df %>% head() %>% kable() %>% kable_styling(full_width = TRUE)
# --- Clean styled summary table ---
summary_table <- data.frame(
Variable = c("channel", "age", "items", "amount", "region"),
Type = c("Factor", "Numeric", "Numeric", "Numeric", "Factor"),
Min = c("—", "18", "1", "$5.01", "—"),
Mean = c("—", "45.76", "4.51", "$835.92", "—"),
Median = c("—", "45", "4", "$582.32", "—"),
Max = c("—", "85", "8", "$3,000", "—")
)
summary_table %>%
kable(align = c("l","l","r","r","r","r","l")) %>%
kable_styling(full_width = TRUE)
```
### Interpretation
The dataset contains 80,000 rows and 5 columns. Each row represents a single customer transaction. A statistical summary reveals the following initial observations:
- **Age** ranges from 18 to 85, with a mean of approximately 46 years.
- **Items** purchased per transaction ranges from 1 to 8, averaging around 4 to 5 items.
- **Amount** spent ranges from near zero to several thousand dollars, suggesting potential outliers at both ends that will need investigation.
- **Channel** and **Region** are correctly encoded as factors with their respective labels.
No immediately obvious structural issues are visible, but we will now systematically check for missing values, duplicates, and outliers.
## Missing Values
Missing data is one of the most common data quality issues in real-world datasets. If left unaddressed, missing values can cause errors in statistical functions and machine learning models, or silently produce biased results. We systematically check every column for the presence of NA values before proceeding.
```{r}
#| label: missing-values
# --- Count of missing values per column ---
missing_summary <- df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(df),
Missing_Percent = round((Missing_Count / Total_Rows) * 100, 2)
)
# --- Display as a clean table ---
missing_summary %>%
kable(
col.names = c("Variable", "Missing Count", "Total Rows", "Missing (%)"),
caption = "Table: Missing Value Summary"
) %>%
kable_styling(full_width = FALSE)
```
### Interpretation
The missing value check confirms that the dataset is complete. Every column registers zero missing values across all 80,000 records. This is an important quality indicator, as it means no imputation strategy is required and we can proceed to the next preprocessing step with confidence that no records will be silently dropped during analysis.
## Duplicate Records
Duplicate records occur when the same transaction is captured more than once in a dataset. This can happen due to system errors, data entry mistakes, or merging of multiple data sources. Duplicates artificially inflate sample sizes and can bias both descriptive statistics and machine learning models. We identify and remove any exact duplicate rows before proceeding.
```{r}
#| label: duplicates
# --- Identify and save duplicates before removal ---
dupes <- df[duplicated(df), ]
duplicate_count <- nrow(dupes)
# --- Remove duplicates ---
df <- df %>% distinct()
# --- Print counts together ---
cat("Duplicate rows detected:", duplicate_count, "\n",
"Rows after deduplication:", nrow(df), "\n",
"Columns:", ncol(df))
# --- Preview of removed duplicates as styled table ---
dupes %>%
head() %>%
kable(caption = "Table: Preview of Duplicate Records Removed") %>%
kable_styling(full_width = FALSE)
```
### Interpretation
The duplicate check identified 21 exact duplicate records within the dataset. These rows were identical across all five variables, suggesting they are most likely the result of a data entry or system logging error rather than genuine repeat transactions. It is statistically improbable that 21 different customers would independently purchase the exact same number of items for the exact same amount in the same region through the same channel.
All 21 duplicate rows were removed using `distinct()`, reducing the dataset from 80,000 to 79,979 records. This is a negligible reduction of just 0.026% and will have no meaningful impact on the analysis, but it ensures the integrity of every subsequent step.
## Data Types Validation
Every variable in a dataset must be stored as the correct data type. A variable stored in the wrong type can cause silent errors. For example, a categorical variable stored as a numeric will be treated as a continuous measurement by statistical functions and machine learning algorithms, producing nonsensical results. We verify that each variable is correctly typed before proceeding.
```{r}
#| label: data-types
# --- Check data types of all columns ---
data_types <- df %>%
summarise(across(everything(), class)) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Data_Type") %>%
mutate(
Expected_Type = c("Factor", "Numeric", "Numeric", "Numeric", "Factor"),
Status = ifelse(Data_Type == c("factor", "numeric",
"numeric", "numeric", "factor"),
"✔ Correct", "✘ Incorrect")
)
# --- Display as clean table ---
data_types %>%
kable(
col.names = c("Variable", "Current Type", "Expected Type", "Status"),
caption = "Table: Data Types Validation"
) %>%
kable_styling(full_width = FALSE)
```
### Interpretation
All five variables are correctly typed:
- **Channel** and **Region** are stored as factors, meaning R treats them as discrete categories rather than continuous numbers. This is critical, for without this, a region coded 1 to 4 would be interpreted as a measurement where West (4) is somehow "greater than" North (1), which is meaningless.
- **Age**, **Items**, and **Amount** are stored as numeric, which is appropriate as they represent measurable quantities that can be averaged, summed, and used in regression models.
No data type corrections are required. We proceed to outlier detection.
## Outlier Detection
Outliers are observations that fall unusually far from the rest of the data. They can arise from genuine extreme behaviour, such as a customer making a very large purchase or from data entry errors. Either way, they require careful investigation because they can heavily distort statistical measures like the mean, and can negatively impact the performance of machine learning models.
We use two complementary approaches to detect outliers:
1. **Visual inspection**: boxplots to see the distribution of each numeric variable and immediately spot extreme values
2. **The IQR Method**: a statistical rule that flags any observation falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR as a potential outlier. This is the industry standard for outlier detection.
We examine the three numeric variables: **age**, **items**, and **amount**.
```{r}
#| label: outlier-detection
#| fig-width: 10
#| fig-height: 5
# --- Define a consistent colour palette ---
palette <- c(
"age" = "#00d4ff",
"items" = "#7b61ff",
"amount" = "#f0c040"
)
# --- Reshape data for faceted boxplots ---
df_long <- df %>%
select(age, items, amount) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Value")
# --- Boxplots ---
ggplot(df_long, aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot(outlier.colour = "#ff4d4d",
outlier.alpha = 0.3,
outlier.size = 0.8,
width = 0.5) +
facet_wrap(~ Variable, scales = "free", nrow = 1) +
scale_fill_manual(values = palette) +
labs(
title = "Figure: Boxplot Distribution of Numeric Variables",
subtitle = "Red dots indicate potential outliers flagged by the IQR method",
x = NULL,
y = "Value"
) +
theme_minimal(base_size = 13) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "#1a1a2e", color = NA),
strip.text = element_text(color = "#00d4ff",
face = "bold",
size = 10),
plot.title = element_text(color = "#ffffff",
face = "bold",
size = 13),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- IQR Method: Count outliers per variable ---
count_outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
sum(x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR))
}
outlier_summary <- df %>%
summarise(
Age = count_outliers(age),
Items = count_outliers(items),
Amount = count_outliers(amount)
) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Outlier_Count") %>%
mutate(
Total_Rows = nrow(df),
Outlier_Percent = round((Outlier_Count / Total_Rows) * 100, 2)
)
# --- Display outlier summary table ---
outlier_summary %>%
kable(
col.names = c("Variable", "Outlier Count",
"Total Rows", "Outliers (%)"),
caption = "Table: Outlier Count by Variable (IQR Method)"
) %>%
kable_styling(full_width = FALSE)
```
### Interpretation
The boxplots and IQR analysis reveal a clear and important finding:
- **Age**: Zero outliers detected. The age range of 18 to 85 years is entirely plausible for a retail electronics customer base. The distribution is clean and requires no intervention.
- **Items** : Zero outliers detected. With a natural ceiling of 8 items per transaction, this variable is tightly bounded and consistent throughout the dataset.
- **Amount** : **2,320 outliers detected, representing 2.9% of all transactions.** The boxplot clearly shows a right-skewed distribution, with a cluster of transactions at the higher end of the spending range pulling the mean (\$835.92) significantly above the median (\$582.32). These are transactions where customers spent considerably more than the typical customer.
The critical question is: are these outliers errors, or are they real? In an electronics retail context, high-value transactions are entirely plausible. A customer purchasing a high-end television, laptop, or multiple premium devices in a single transaction could legitimately spend at the upper end of the range. The minimum amount of \$5.00 and maximum of \$3,000.00 are both realistic for an electronics retailer.
These outliers are therefore likely **genuine extreme values** rather than data entry errors, and the decision of how to treat them will be addressed in the next step.
## Outlier Treatment
Having identified 2,320 outliers in the `amount` variable, we must now decide how to handle them. There are three common approaches:
1. **Remove them**: delete all flagged rows from the dataset. This is appropriate when outliers are confirmed errors. However, removing 2.9% of genuine transactions would mean discarding real customer behaviour and potentially biasing our analysis against high-value customers.
2. **Cap them (Winsorisation)**: replace values beyond the IQR boundaries with the boundary value itself. This retains all records but reduces the influence of extreme values.
3. **Retain them**: keep the outliers as-is, on the basis that they represent genuine behaviour. This is appropriate when the data generating process naturally produces extreme values.
Given the business context, an electronics retailer where high-value purchases are entirely expected, a **retain and document** strategy is adopted. The outliers in `amount` reflect real purchasing behaviour from high-spending customers, which is precisely the kind of insight Blackwell's marketing team needs to understand. Removing or capping these values would obscure an important segment of their customer base.
However, to ensure the statistical models are not unduly distorted, a **log transformation** is applied to `amount` where appropriate during the modelling phase. For now, the outliers is documented and proceed with the full dataset intact.
```{r}
#| label: outlier-treatment
#| fig-width: 10
#| fig-height: 4
# --- Define IQR boundaries for amount ---
Q1 <- quantile(df$amount, 0.25)
Q3 <- quantile(df$amount, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
cat("Amount IQR Boundaries:\n")
cat(" Lower Bound: $", round(lower_bound, 2), "\n")
cat(" Upper Bound: $", round(upper_bound, 2), "\n")
cat(" Outliers below lower bound:", sum(df$amount < lower_bound), "\n")
cat(" Outliers above upper bound:", sum(df$amount > upper_bound), "\n\n")
# --- Decision: Retain outliers ---
# We retain all records. Log-transformed amount created for
# modelling use later.
df <- df %>%
mutate(amount_log = log1p(amount))
cat("Log-transformed amount column added.\n")
cat("Final dataset dimensions:\n")
cat(" Rows:", nrow(df), "| Columns:", ncol(df), "\n")
# --- Visual: Original vs Log-transformed amount ---
p1 <- ggplot(df, aes(x = amount)) +
geom_histogram(fill = "#f0c040", color = NA, bins = 60, alpha = 0.85) +
labs(title = "Original Amount", x = "Amount ($)", y = "Count") +
theme_minimal(base_size = 11) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff", face = "bold"),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a")
)
p2 <- ggplot(df, aes(x = amount_log)) +
geom_histogram(fill = "#00d4ff", color = NA, bins = 60, alpha = 0.85) +
labs(title = "Log-Transformed Amount", x = "Log(Amount + 1)", y = "Count") +
theme_minimal(base_size = 11) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff", face = "bold"),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a")
)
# --- Display side by side ---
library(patchwork)
p1 + p2 +
plot_annotation(
title = "Figure : Amount Distribution: Original vs Log-Transformed",
subtitle = "Log transformation compresses the right skew for modelling use",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13),
plot.subtitle = element_text(color = "#8a8a9a", size = 10)
)
)
```
### Interpretation
The IQR analysis calculates a lower bound of **-\$1,137.30** and an upper bound of **\$2,655.82** for the `amount` variable.
The negative lower bound is mathematically produced by the IQR formula but is practically meaningless in this context, a transaction amount cannot be negative. This tells us that the data is heavily right-skewed, causing the IQR formula to push the lower boundary into impossible territory. Unsurprisingly, zero outliers were detected below the lower bound.
All 2,320 outliers fall **above** the upper bound of \$2,655.82. These are high-value transactions where customers spent between \$2,655.82 and \$3,000.00 in a single purchase. In the context of an electronics retailer, this is entirely plausible. A customer purchasing a premium laptop, large-screen television, or multiple devices in one transaction could easily reach this spending level.
The original distribution histogram confirms the right skew clearly, the majority of transactions cluster at lower amounts, with a long tail stretching toward \$3,000.
The log-transformed distribution, however, does not produce the smooth bell curve typically expected. Instead it reveals a **multimodal pattern**, multiple peaks across the distribution, suggesting that customer spending may naturally cluster around distinct price points. This is actually a meaningful business insight in itself: Blackwell's customers do not spend randomly across a continuous range, but appear to gravitate toward particular spending brackets, possibly corresponding to different product categories such as accessories, mid-range electronics, and premium devices.
While the log transformation reduces the right skew, the multimodal nature of the data means it will not perfectly satisfy the normality assumptions of some statistical models. This will be kept in mind during the modelling phase.
All **79,979 records are retained**. The dataset is now fully preprocessed and ready for exploratory data analysis.
## Preprocessing Summary
This section summarises every step taken during the preprocessing phase. A clean, documented audit trail of data preparation decisions is a hallmark of professional analytical work. It ensures transparency, reproducibility, and gives any reader full confidence in the integrity of the dataset before conclusions are drawn.
```{r}
#| label: preprocessing-summary
# --- Build the preprocessing audit table ---
preprocessing_log <- tibble(
Step = c(
"1. Data Inspection",
"2. Missing Values",
"3. Duplicate Records",
"4. Data Types Validation",
"5. Outlier Detection",
"6. Outlier Treatment"
),
Finding = c(
"80,000 rows, 5 columns. All variables present and accounted for.",
"Zero missing values detected across all 5 columns.",
"21 exact duplicate rows identified.",
"All variables correctly typed; channel & region as factors, age, items & amount as numeric.",
"2,320 outliers detected in amount (2.9%). Zero outliers in age and items.",
"Outliers retained — consistent with electronics retail context. Log transformation applied to amount."
),
Action_Taken = c(
"No action required.",
"No action required.",
"21 duplicate rows removed. Dataset reduced to 79,979 records.",
"No action required.",
"No action required. Outliers confirmed as genuine high-value transactions.",
"amount_log column added. All 79,979 records retained."
),
Status = c(
"✔ Complete",
"✔ Complete",
"✔ Complete",
"✔ Complete",
"✔ Complete",
"✔ Complete"
)
)
# --- Display as styled table ---
preprocessing_log %>%
kable(
col.names = c("Step", "Finding", "Action Taken", "Status"),
caption = "Table: Data Preprocessing Audit Log"
) %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
column_spec(4, bold = TRUE, color = "#00d4ff")
```
## Clean Dataset, Ready for Analysis
The preprocessing phase is now complete. The following summarises the final state of the dataset entering the analysis phase:
| Attribute | Detail |
|-----------------|--------------------------|
| Total Records | 79,979 transactions |
| Total Variables | 6 (including amount_log) |
| Missing Values | None |
| Duplicates | Removed (21 rows) |
| Outliers | Retained & documented |
| Data Types | All correctly assigned |
The dataset is clean, validated, and fully documented. We now proceed to **Exploratory Data Analysis**, where we begin to answer the business questions posed by Danielle Sherman and Martin Goodrich.
# Exploratory Data Analysis
With a clean, validated dataset in hand, this analysis now turns to Exploratory Data Analysis (EDA), the process of visually and statistically examining the data to uncover patterns, relationships, and insights that directly address the business questions posed by Danielle Sherman and Martin Goodrich.
EDA is not simply about producing charts. It is about telling a story with data, understanding who Blackwell's customers are, how they behave, and what factors drive their purchasing decisions. Every visualisation in this section is intentional and directly linked to a specific business question from the brief.
This EDA is structured around five key themes:
1. **Regional Spending Patterns**: Which regions spend the most and least per transaction? This directly addresses Danielle's first question about whether geography influences spending behaviour.
2. **Items vs Amount**: Is there a relationship between the number of items purchased and the amount spent? A pattern here could inform decisions around bundling, promotions, and upselling.
3. **Customer Age & Spending**: How does customer age relate to transaction amount? This directly investigates Martin Goodrich's hypothesis that older customers spend more on electronics than younger customers.
4. **Purchase Channel Analysis**: How do online and in-store customers differ in age and spending behaviour? This addresses Martin's second hypothesis that in-store customers tend to be older than online customers, a claim with significant implications for website design and digital marketing strategy.
5. **Age Differences Across Regions**: Are there meaningful differences in the age profile of customers across the four regions? This addresses Danielle's question about whether region can be predicted from demographic data, and lays the foundation for the predictive modelling section that follows.
Together, these five analyses form a comprehensive picture of Blackwell's customer base, one that will directly inform the marketing, website, and business strategy decisions under consideration by the leadership team.
## Regional Spending Patterns
The first business question posed by Danielle is whether customers in different regions spend different amounts per transaction, and which regions spend the most and least. Understanding regional spending is critical for directing marketing budgets. It makes little sense to invest equally across all regions if some consistently generate higher transaction values than others.
Regional spending is examined here using three complementary views: a summary statistics table, a bar chart of mean spending by region, and a boxplot showing the full distribution of spending per region.
```{r}
#| label: regional-spending
#| fig-width: 10
#| fig-height: 5
# --- Summary statistics by region ---
regional_summary <- df %>%
group_by(region) %>%
summarise(
Transactions = n(),
Mean_Amount = round(mean(amount), 2),
Median_Amount = round(median(amount), 2),
SD_Amount = round(sd(amount), 2),
Min_Amount = round(min(amount), 2),
Max_Amount = round(max(amount), 2)
) %>%
arrange(desc(Mean_Amount))
# --- Display summary table ---
regional_summary %>%
kable(
col.names = c("Region", "Transactions", "Mean ($)",
"Median ($)", "Std Dev ($)",
"Min ($)", "Max ($)"),
caption = "Table: Regional Spending Summary Statistics"
) %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
row_spec(1, background = "rgba(0, 212, 255, 0.08)")
# --- Region colour palette ---
region_palette <- c(
"North" = "#00d4ff",
"South" = "#7b61ff",
"East" = "#f0c040",
"West" = "#ff6b6b"
)
# --- Bar chart: Mean spending by region ---
p1 <- regional_summary %>%
ggplot(aes(x = reorder(region, Mean_Amount),
y = Mean_Amount,
fill = region)) +
geom_col(width = 0.6, alpha = 0.9) +
geom_text(aes(label = paste0("$", format(Mean_Amount, big.mark = ","))),
hjust = -0.1,
color = "#e8e8e8",
size = 3.8,
fontface = "bold") +
scale_fill_manual(values = region_palette) +
scale_y_continuous(
labels = dollar_format(),
expand = expansion(mult = c(0, 0.2))
) +
coord_flip() +
labs(
title = "Mean Transaction Amount by Region",
subtitle = "West region leads in average spend per transaction",
x = NULL,
y = "Mean Amount ($)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Boxplot: Full distribution by region ---
p2 <- df %>%
ggplot(aes(x = reorder(region, amount, FUN = mean),
y = amount,
fill = region)) +
geom_boxplot(outlier.colour = "#ff4d4d",
outlier.alpha = 0.2,
outlier.size = 0.6,
width = 0.5,
alpha = 0.85) +
scale_fill_manual(values = region_palette) +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(
title = "Spending Distribution by Region",
subtitle = "Red dots indicate high-value transactions",
x = NULL,
y = "Transaction Amount ($)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Display side by side ---
p1 + p2 +
plot_annotation(
title = "Figure : Regional Spending Analysis",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13)
)
)
```
### Interpretation
Regional spending varies considerably across Blackwell's customer base. The West leads with a mean transaction value of **\$1,283.94**, followed by East (\$917.97), North (\$744.99), and South (\$252.10).
Most notably, the South region's maximum transaction value is just **\$499.94**, a ceiling no other region comes close to, suggesting Southern customers are exclusively purchasing lower-range products with no high-value purchasing behaviour present at all.
**Recommendation:** Premium campaigns should target the West and East. The South requires a value-oriented strategy, and further investigation into why high-value purchases are entirely absent from this region.
## Items vs Amount Relationship
A key question for Blackwell's sales strategy is whether customers who purchase more items in a single transaction also spend more money. If a strong positive relationship exists, it would support strategies such as bundling, cross-selling, and volume discounts to drive revenue. The relationship between items purchased and transaction amount is examined here using a scatter plot and a correlation analysis.
```{r}
#| label: items-vs-amount
#| fig-width: 10
#| fig-height: 5
# --- Pearson Correlation ---
correlation <- cor(df$items, df$amount, method = "pearson")
cat("Pearson Correlation (items vs amount):", round(correlation, 4), "\n")
# --- Scatter plot with trend line ---
p1 <- df %>%
ggplot(aes(x = items, y = amount)) +
geom_jitter(alpha = 0.08, color = "#00d4ff", size = 0.6, width = 0.3) +
geom_smooth(method = "lm", color = "#f0c040",
se = TRUE, linewidth = 1.2) +
scale_y_continuous(labels = dollar_format()) +
scale_x_continuous(breaks = 1:8) +
labs(
title = "Items Purchased vs Transaction Amount",
subtitle = paste0("Pearson r = ", round(correlation, 4),
" — negligible linear relationship"),
x = "Number of Items Purchased",
y = "Transaction Amount ($)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a")
)
# --- Boxplot: Amount distribution per item count ---
p2 <- df %>%
ggplot(aes(x = factor(items), y = amount, fill = factor(items))) +
geom_boxplot(outlier.colour = "#ff4d4d",
outlier.alpha = 0.2,
outlier.size = 0.6,
width = 0.6,
alpha = 0.85) +
scale_fill_manual(values = c(
"1" = "#00d4ff", "2" = "#7b61ff", "3" = "#f0c040",
"4" = "#ff6b6b", "5" = "#00ff9d", "6" = "#ff9d00",
"7" = "#ff61d8", "8" = "#61ffed"
)) +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Amount Distribution by Items Purchased",
subtitle = "Spending spread is consistent regardless of item count",
x = "Number of Items Purchased",
y = "Transaction Amount ($)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Display side by side ---
p1 + p2 +
plot_annotation(
title = "Figure: Relationship Between Items Purchased & Amount Spent",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13)
)
)
```
### Interpretation
The Pearson correlation between items purchased and amount spent is virtually zero (r = 0.0004), confirming that **no meaningful linear relationship exists** between the two variables. The scatter plot and boxplots reinforce this: spending is widely and randomly distributed regardless of how many items a customer buys.
**Recommendation:** Bundling and volume-based pricing strategies are unlikely to drive higher revenue. Spending appears to be driven by product type and price point rather than quantity purchased.
## Customer Age & Spending
Martin Goodrich hypothesises that older customers spend more on electronics than younger customers. This is a critical claim. If true, it would justify design changes to the website and targeted marketing campaigns aimed at older demographics. If false, such investments could be misdirected and costly.
This hypothesis is tested directly using a scatter plot of age against transaction amount, supported by a correlation analysis and a mean spending summary across age groups.
```{r}
#| label: age-vs-spending
#| fig-width: 10
#| fig-height: 5
# --- Pearson Correlation: Age vs Amount ---
age_corr <- cor(df$age, df$amount, method = "pearson")
cat("Pearson Correlation (age vs amount):", round(age_corr, 4), "\n")
# --- Create three age groups ---
df <- df %>%
mutate(age_group = cut(age,
breaks = c(17, 39, 59, 85),
labels = c("Young Adults (18-39)",
"Middle-Aged (40-59)",
"Senior (60+)")
))
# --- Mean spending by age group ---
age_group_summary <- df %>%
group_by(age_group) %>%
summarise(
Transactions = n(),
Mean_Amount = round(mean(amount), 2),
Median_Amount = round(median(amount), 2)
)
# --- Display summary table ---
age_group_summary %>%
kable(
col.names = c("Age Group", "Transactions",
"Mean ($)", "Median ($)"),
caption = "Table: Mean Spending by Age Group"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff")
# --- Scatter plot: Age vs Amount ---
p1 <- df %>%
ggplot(aes(x = age, y = amount, color = age_group)) +
geom_point(alpha = 0.06, size = 0.7) +
geom_smooth(method = "lm", color = "#f0c040",
se = TRUE, linewidth = 1.2) +
scale_color_manual(values = c(
"Young Adults (18-39)" = "#00d4ff",
"Middle-Aged (40-59)" = "#7b61ff",
"Senior (60+)" = "#ff6b6b"
)) +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Customer Age vs Transaction Amount",
subtitle = paste0("Pearson r = ", round(age_corr, 4),
" — negligible relationship"),
x = "Customer Age",
y = "Transaction Amount ($)",
color = "Age Group"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.title = element_text(color = "#8a8a9a", size = 9),
legend.text = element_text(color = "#8a8a9a", size = 8),
legend.background = element_rect(fill = "#12121a", color = NA)
)
# --- Bar chart: Mean spending by age group ---
p2 <- age_group_summary %>%
ggplot(aes(x = age_group, y = Mean_Amount, fill = age_group)) +
geom_col(width = 0.5, alpha = 0.9) +
geom_text(aes(label = paste0("$", format(Mean_Amount, big.mark = ","))),
vjust = -0.5,
color = "#e8e8e8",
size = 3.8,
fontface = "bold") +
scale_fill_manual(values = c(
"Young Adults (18-39)" = "#00d4ff",
"Middle-Aged (40-59)" = "#7b61ff",
"Senior (60+)" = "#ff6b6b"
)) +
scale_y_continuous(
labels = dollar_format(),
expand = expansion(mult = c(0, 0.15))
) +
labs(
title = "Mean Spending by Age Group",
subtitle = "No consistent upward trend with increasing age",
x = NULL,
y = "Mean Amount ($)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Display side by side ---
p1 + p2 +
plot_annotation(
title = "Figure: Customer Age & Spending Behaviour",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13)
)
)
```
### Interpretation Martin's Hypothesis #2
The data directly **contradicts** Martin's hypothesis. Rather than older customers spending more, the analysis reveals the opposite; a **negative correlation (r = -0.282)** between age and transaction amount, meaning older customers actually tend to spend **less** than younger customers.
The age group summary confirms this clearly:
| Age Group | Mean Spend |
|----------------------|------------|
| Young Adults (18-39) | \$991.08 |
| Middle-Aged (40-59) | \$869.77 |
| Senior (60+) | \$469.83 |
Spending declines consistently as age increases — Young Adults spend on average more than **twice as much** as Senior customers (\$991.08 vs \$469.83).
**Recommendation:** Martin's proposed website design changes and marketing activities targeting older customers as higher spenders are not supported by the data. If anything, marketing efforts aimed at driving higher transaction values should focus on **younger demographics**, particularly the 18-39 age group, which consistently generates the highest average spend.
## Purchase Channel Analysis
Martin Goodrich's second hypothesis is that customers who shop in-store are older than customers who shop online. This claim has significant implications. If true, it would suggest the website should be redesigned to appeal to younger shoppers, while in-store experiences are tailored to older customers.
This hypothesis is tested directly by comparing the age distribution of online and in-store customers, supported by a statistical test to determine whether any observed difference is meaningful or simply due to chance.
```{r}
#| label: channel-analysis
#| fig-width: 10
#| fig-height: 5
# --- Summary statistics by channel ---
channel_summary <- df %>%
group_by(channel) %>%
summarise(
Transactions = n(),
Mean_Age = round(mean(age), 1),
Median_Age = round(median(age), 1),
Mean_Amount = round(mean(amount), 2),
Median_Amount = round(median(amount), 2)
)
# --- Display summary table ---
channel_summary %>%
kable(
col.names = c("Channel", "Transactions", "Mean Age",
"Median Age", "Mean Amount ($)",
"Median Amount ($)"),
caption = "Table: Customer Profile by Purchase Channel"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff")
# --- T-test: Is the age difference statistically significant? ---
online_ages <- df %>% filter(channel == "Online") %>% pull(age)
instore_ages <- df %>% filter(channel == "In-Store") %>% pull(age)
ttest_result <- t.test(instore_ages, online_ages)
cat("\nWelch Two-Sample T-Test: Age by Channel\n")
cat(" In-Store Mean Age:", round(mean(instore_ages), 2), "\n")
cat(" Online Mean Age: ", round(mean(online_ages), 2), "\n")
cat(" t-statistic: ", round(ttest_result$statistic, 4), "\n")
cat(" p-value: ", round(ttest_result$p.value, 6), "\n")
cat(" Conclusion: ",
ifelse(ttest_result$p.value < 0.05,
"Statistically significant difference (p < 0.05)",
"No statistically significant difference (p >= 0.05)"), "\n")
# --- Channel colour palette ---
channel_palette <- c("Online" = "#00d4ff", "In-Store" = "#ff6b6b")
# --- Density plot: Age distribution by channel ---
p1 <- df %>%
ggplot(aes(x = age, fill = channel, color = channel)) +
geom_density(alpha = 0.3, linewidth = 1) +
scale_fill_manual(values = channel_palette) +
scale_color_manual(values = channel_palette) +
labs(
title = "Age Distribution by Purchase Channel",
subtitle = "Online vs In-Store customer age profiles",
x = "Customer Age",
y = "Density",
fill = "Channel",
color = "Channel"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.background = element_rect(fill = "#12121a", color = NA),
legend.title = element_text(color = "#8a8a9a", size = 9),
legend.text = element_text(color = "#8a8a9a", size = 8)
)
# --- Boxplot: Age by channel ---
p2 <- df %>%
ggplot(aes(x = channel, y = age, fill = channel)) +
geom_boxplot(outlier.colour = "#ff4d4d",
outlier.alpha = 0.3,
outlier.size = 0.8,
width = 0.5,
alpha = 0.85) +
geom_jitter(alpha = 0.02, width = 0.2,
color = "#ffffff", size = 0.4) +
scale_fill_manual(values = channel_palette) +
labs(
title = "Age Spread by Purchase Channel",
subtitle = "Median age comparison between channels",
x = NULL,
y = "Customer Age"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Display side by side ---
p1 + p2 +
plot_annotation(
title = "Figure: Purchase Channel: Age Analysis",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13)
)
)
```
### Interpretation: Martin's Hypothesis #1
The data **partially supports** Martin's hypothesis but in the opposite direction to what he expected.
Online customers are actually **older** (mean age 48.56) than in-store customers (mean age 42.96), contradicting Martin's claim that in-store shoppers are older. The Welch t-test confirms this difference is **statistically significant** (t = -51.21, p \< 0.0001), meaning it is extremely unlikely to be due to chance across 79,979 transactions.
Additionally, in-store customers spend considerably more on average (**\$688.18**) compared to online customers (**\$441.89**), despite being younger, which further contradicts Martin's assumption that older customers are the higher spenders.
**Recommendation:** The evidence does not support redesigning the website to attract older buyers, older customers are already the primary online shoppers. Instead, attention should focus on understanding why younger in-store customers are generating higher transaction values, and whether that behaviour can be replicated through the online channel to drive eCommerce revenue.
## Age Differences Across Regions
Danielle's second email raises an important question: are there meaningful differences in the age profile of customers across the four regions? Understanding whether age varies by region is valuable for two reasons: it informs region-specific marketing strategies, and it lays the analytical foundation for the predictive modelling section that follows, where age and other demographic variables are used to predict customer behaviour.
Age distributions across the four regions are examined here using a combination of summary statistics, density plots, and boxplots.
```{r}
#| label: age-by-region
#| fig-width: 10
#| fig-height: 5
# --- Summary statistics: Age by region ---
region_age_summary <- df %>%
group_by(region) %>%
summarise(
Transactions = n(),
Mean_Age = round(mean(age), 1),
Median_Age = round(median(age), 1),
SD_Age = round(sd(age), 1),
Min_Age = min(age),
Max_Age = max(age)
) %>%
arrange(desc(Mean_Age))
# --- Display summary table ---
region_age_summary %>%
kable(
col.names = c("Region", "Transactions", "Mean Age",
"Median Age", "Std Dev",
"Min Age", "Max Age"),
caption = "Table: Customer Age Profile by Region"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
row_spec(1, background = "rgba(0, 212, 255, 0.08)")
# --- ANOVA: Is age difference across regions significant? ---
anova_result <- aov(age ~ region, data = df)
anova_summary <- summary(anova_result)
p_value <- anova_summary[[1]][["Pr(>F)"]][1]
cat("\nOne-Way ANOVA: Age by Region\n")
cat(" F-statistic:", round(anova_summary[[1]][["F value"]][1], 2), "\n")
cat(" p-value: ", round(p_value, 6), "\n")
cat(" Conclusion: ",
ifelse(p_value < 0.05,
"Statistically significant age differences across regions (p < 0.05)",
"No statistically significant difference (p >= 0.05)"), "\n")
# --- Region colour palette ---
region_palette <- c(
"North" = "#00d4ff",
"South" = "#7b61ff",
"East" = "#f0c040",
"West" = "#ff6b6b"
)
# --- Density plot: Age distribution by region ---
p1 <- df %>%
ggplot(aes(x = age, fill = region, color = region)) +
geom_density(alpha = 0.25, linewidth = 1) +
scale_fill_manual(values = region_palette) +
scale_color_manual(values = region_palette) +
labs(
title = "Age Distribution by Region",
subtitle = "Distinct age profiles visible across regions",
x = "Customer Age",
y = "Density",
fill = "Region",
color = "Region"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.background = element_rect(fill = "#12121a", color = NA),
legend.title = element_text(color = "#8a8a9a", size = 9),
legend.text = element_text(color = "#8a8a9a", size = 8)
)
# --- Boxplot: Age spread by region ---
p2 <- df %>%
ggplot(aes(x = reorder(region, age, FUN = mean),
y = age, fill = region)) +
geom_boxplot(outlier.colour = "#ff4d4d",
outlier.alpha = 0.3,
outlier.size = 0.8,
width = 0.5,
alpha = 0.85) +
geom_jitter(alpha = 0.015, width = 0.2,
color = "#ffffff", size = 0.4) +
scale_fill_manual(values = region_palette) +
coord_flip() +
labs(
title = "Age Spread by Region",
subtitle = "Ordered by mean age — South notably older",
x = NULL,
y = "Customer Age"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#8a8a9a"),
axis.title = element_text(color = "#8a8a9a"),
legend.position = "none"
)
# --- Display side by side ---
p1 + p2 +
plot_annotation(
title = "Figure: Age Distribution Across Regions",
theme = theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 13)
)
)
```
### Interpretation
Significant age differences exist across Blackwell's four regions. The mean age ranking from oldest to youngest is:
1. **South**: 56.6 years (oldest)
2. **East**: 45.6 years
3. **North**: 43.7 years
4. **West**: 38.8 years (youngest)
The South region stands out strikingly. Iits customers are on average **17.8 years older** than West region customers. Combined with the earlier finding that the South has the lowest spending ceiling (\$499.94 maximum), this paints a consistent picture; the South has an older, more conservative spending customer base.
The West, by contrast, has the youngest customer profile and the highest transaction values. Reinforcing the earlier finding that younger customers tend to spend more.
The One-Way ANOVA confirms these differences are **statistically significant** (F = 6,139.42, p \< 0.0001). The age gaps across regions are not due to chance.
**Recommendation:** Regional marketing strategies should account for these demographic differences. The South may respond better to trust-building, value-oriented messaging suited to an older audience. The West and East, with younger and higher-spending customers, are better targets for premium and technology-forward campaigns.
# Predictive Modelling
The exploratory analysis has revealed clear patterns in Blackwell's customer data. This section builds on those findings by developing predictive models that can anticipate customer behaviour, moving beyond description into prediction.
Three prediction tasks are addressed:
1. **Purchase Channel Prediction**: Can a customer's likelihood of shopping online or in-store be predicted from their demographic and transaction data? *(Binary Classification)*
2. **Customer Age Prediction**: Can the age of a customer be estimated from their purchasing behaviour and region? *(Regression)*
3. **Region Prediction**: Can the region a customer belongs to be predicted from their demographic profile? *(Multi-Class Classification)*
For each task, three models are trained and evaluated on a held-out test set, and the best performing model is selected for final interpretation and business recommendations.
## Modelling Approach
All three tasks follow a consistent workflow:
- **Data splitting**: 80% training, 20% test set
- **Reproducibility**: a fixed random seed (set.seed(1234)) is used throughout to ensure results are fully reproducible
- **Model training**: three candidate models trained per task
- **Model selection**: best model chosen based on evaluation metrics on the test set
- **Interpretation**: findings translated into actionable business recommendations
## Features Available for Modelling
| Task | Target Variable | Predictor Variables |
|--------------------|-----------------|--------------------------------|
| Channel Prediction | channel | age, items, amount, region |
| Age Prediction | age | channel, items, amount, region |
| Region Prediction | region | age, channel, items, amount |
```{r}
#| label: modelling-setup
# --- Set seed for reproducibility ---
set.seed(1234)
# --- Define reusable train control ---
# 5-fold cross validation during training
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = "final",
verboseIter = FALSE
)
cat("Modelling setup complete.\n")
cat("Random seed set to 1234 for full reproducibility.\n")
cat("5-fold cross-validation configured for all models.\n")
```
A 5-fold cross-validation strategy is applied during training. This means the training set is divided into five equal folds. The model trains on four folds and validates on the fifth, rotating until every fold has served as the validation set. This produces a more reliable estimate of model performance than a single train/validate split, and reduces the risk of overfitting.
# Purchase Channel Prediction
The first modelling task investigates whether a customer's purchase channel, **online or in-store**, can be predicted from their demographic and transaction data. This is a binary classification problem, where the target variable has two possible outcomes: Online or In-Store.
Accurately predicting purchase channel would allow Blackwell to proactively identify customers likely to shop online versus in-store, enabling more targeted marketing, personalised communications, and smarter inventory decisions across channels.
**Candidate predictor variables:** age, items, amount, region
## Data Splitting
```{r}
#| label: channel-split
# --- 80/20 Train/Test Split ---
train_index <- createDataPartition(df$channel, p = 0.8, list = FALSE)
train_channel <- df[train_index, ]
test_channel <- df[-train_index, ]
cat("Channel Prediction: Data Split Summary:\n")
cat(" Training set:", nrow(train_channel), "rows\n")
cat(" Test set: ", nrow(test_channel), "rows\n\n")
# --- Verify class balance in split ---
cat("Training set channel distribution:\n")
print(prop.table(table(train_channel$channel)))
cat("\nTest set channel distribution:\n")
print(prop.table(table(test_channel$channel)))
```
## Feature Selection
```{r}
#| label: channel-rfe
#| cache: true
# --- Recursive Feature Elimination ---
# Using Random Forest as the estimator for feature importance
rfe_control <- rfeControl(
functions = rfFuncs,
method = "cv",
number = 5,
verbose = FALSE
)
# --- Prepare features and target ---
# Convert region to numeric for RFE compatibility
train_channel_rfe <- train_channel %>%
mutate(
region_num = as.numeric(region),
channel_num = as.numeric(channel)
)
rfe_features <- train_channel_rfe %>%
select(age, items, amount, region_num)
rfe_target <- train_channel_rfe$channel
# --- Run RFE ---
set.seed(1234)
rfe_result_channel <- rfe(
x = rfe_features,
y = rfe_target,
sizes = c(1, 2, 3, 4),
rfeControl = rfe_control
)
# --- Results ---
cat("Optimal Features Selected:\n")
print(predictors(rfe_result_channel))
cat("\nRFE Performance Summary:\n")
print(rfe_result_channel)
# --- Extract RFE results for custom plot ---
rfe_results_df <- rfe_result_channel$results
# --- Custom readable RFE plot ---
ggplot(rfe_results_df, aes(x = Variables, y = Accuracy)) +
geom_line(color = "#00d4ff", linewidth = 1.2) +
geom_point(aes(size = Accuracy),
color = "#f0c040", fill = "#f0c040",
shape = 21, stroke = 1.5) +
geom_text(aes(label = round(Accuracy, 4)),
vjust = -1.2,
color = "#ffffff",
size = 3.8,
fontface = "bold") +
scale_x_continuous(breaks = c(1, 2, 3, 4)) +
scale_y_continuous(
limits = c(
min(rfe_results_df$Accuracy) - 0.01,
max(rfe_results_df$Accuracy) + 0.02
),
labels = scales::percent_format(accuracy = 0.1)
) +
labs(
title = "Figure: RFE: Feature Selection for Channel Prediction",
subtitle = "Cross-validated accuracy by number of features retained",
x = "Number of Features",
y = "Accuracy (Cross-Validated)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#0a0a0f", color = NA),
panel.background = element_rect(fill = "#12121a", color = NA),
panel.grid.major = element_line(color = "#2a2a3a", linetype = "dashed"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#e8e8e8", size = 11),
axis.title = element_text(color = "#e8e8e8", size = 11),
legend.position = "none"
)
```
## Feature Selection Results
Recursive Feature Elimination identified **three optimal features** for predicting purchase channel: **region, amount, and age.**
The `items` variable was eliminated, contributing no meaningful predictive power, consistent with the EDA finding that items purchased has no relationship with customer behaviour.
Cross-validated accuracy improves progressively from 74.77% with one feature to **88.72% with three features**. Adding the fourth feature (items) actually reduces accuracy slightly to 88.29%, confirming the three-feature model as optimal.
All three models will be trained using region, amount, and age as predictor variables.\
## Models Training
```{r}
#| label: channel-models
#| cache: true
# --- Prepare train and test sets with RFE selected features ---
train_x <- train_channel %>% select(age, amount, region)
test_x <- test_channel %>% select(age, amount, region)
train_y <- train_channel$channel
test_y <- test_channel$channel
cat("Training models with features: age, amount, region\n\n")
# --- Model 1: Logistic Regression ---
set.seed(1234)
model_lr_channel <- train(
x = train_x,
y = train_y,
method = "glm",
family = "binomial",
trControl = train_control,
preProcess = c("center", "scale")
)
cat("Logistic Regression — Trained\n")
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_channel <- train(
x = train_x,
y = train_y,
method = "rf",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 3
)
cat("Random Forest — Trained\n")
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_channel <- train(
x = train_x,
y = train_y,
method = "rpart",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 10
)
cat("Decision Tree — Trained\n")
cat("\nAll three models trained successfully.\n")
```
## Channel Model Comparison
```{r}
#| label: channel-comparison
# --- Predictions on test set ---
pred_lr_channel <- predict(model_lr_channel, test_x)
pred_rf_channel <- predict(model_rf_channel, test_x)
pred_dt_channel <- predict(model_dt_channel, test_x)
# --- Confusion matrices ---
cm_lr <- confusionMatrix(pred_lr_channel, test_y)
cm_rf <- confusionMatrix(pred_rf_channel, test_y)
cm_dt <- confusionMatrix(pred_dt_channel, test_y)
# --- Model comparison table ---
model_comparison_channel <- tibble(
Model = c("Logistic Regression",
"Random Forest",
"Decision Tree"),
Accuracy = c(cm_lr$overall["Accuracy"],
cm_rf$overall["Accuracy"],
cm_dt$overall["Accuracy"]),
Kappa = c(cm_lr$overall["Kappa"],
cm_rf$overall["Kappa"],
cm_dt$overall["Kappa"]),
Sensitivity = c(cm_lr$byClass["Sensitivity"],
cm_rf$byClass["Sensitivity"],
cm_dt$byClass["Sensitivity"]),
Specificity = c(cm_lr$byClass["Specificity"],
cm_rf$byClass["Specificity"],
cm_dt$byClass["Specificity"])
) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
arrange(desc(Accuracy))
# --- Display comparison table ---
model_comparison_channel %>%
kable(
col.names = c("Model", "Accuracy", "Kappa",
"Sensitivity", "Specificity"),
caption = "Table 9: Model Comparison — Channel Prediction"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
row_spec(1, background = "rgba(0, 212, 255, 0.08)")
# --- Best model statement ---
best_model <- model_comparison_channel$Model[1]
best_accuracy <- round(model_comparison_channel$Accuracy[1] * 100, 2)
best_kappa <- model_comparison_channel$Kappa[1]
cat("Best Performing Model:", best_model, "\n")
cat("Test Set Accuracy: ", best_accuracy, "%\n")
cat("Kappa Statistic: ", best_kappa, "\n")
```
## Purchase Channel Best Model
```{r}
#| label: channel-dt-evaluation
# --- Confusion Matrix: Decision Tree ---
cat("Confusion Matrix — Decision Tree\n")
cat("==================================\n")
print(cm_dt$table)
# --- Actual vs Predicted comparison table ---
dt_results <- tibble(
Actual = test_y,
Predicted = pred_dt_channel,
Correct = test_y == pred_dt_channel
)
# --- Summary of correct vs incorrect predictions ---
dt_summary <- dt_results %>%
group_by(Actual, Predicted) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(
Result = ifelse(Actual == Predicted,
"Correct", "Incorrect")
)
# --- Display actual vs predicted table ---
dt_summary %>%
kable(
col.names = c("Actual", "Predicted",
"Count", "Result"),
caption = "Table 10: Actual vs Predicted — Decision Tree"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
column_spec(4, bold = TRUE,
color = ifelse(dt_summary$Result == "Correct",
"#00ff9d", "#ff6b6b"))
# --- Overall prediction accuracy breakdown ---
cat("\nPrediction Breakdown:\n")
cat(" Correctly classified: ", sum(dt_results$Correct), "\n")
cat(" Incorrectly classified:", sum(!dt_results$Correct), "\n")
cat(" Overall Accuracy: ",
round(mean(dt_results$Correct) * 100, 2), "%\n")
```
## Best model for channel prediction
```{r}
#| label: save-channel-model
# --- Save best model for channel prediction ---
saveRDS(model_dt_channel, "best_model_channel.rds")
cat("Best model for channel prediction saved successfully.\n")
cat("Model: Decision Tree\n")
cat("File: best_model_channel.rds\n")
```
## Interpretation: Channel Prediction
The Decision Tree outperformed both Random Forest and Logistic Regression, achieving a test set accuracy of **88.82%** and a Kappa of **0.7764**, indicating strong agreement between predicted and actual channel classifications beyond chance.
The actual vs predicted table reveals an important asymmetry in model performance:
- **Online customers** are predicted with near-perfect accuracy. Only **1** **Online custome**r was misclassified as In-Store
- **In-Store customers** are harder to predict. **1,744 In-Store customers** were incorrectly classified as Online
This suggests the model has learned a very strong signal for identifying Online customers, likely driven by age and region, but finds In-Store customers more difficult to distinguish, possibly because their demographic profile overlaps more with Online customers in certain regions.
Overall, 14,250 out of 15,995 customers were correctly classified, with only 1,745 misclassifications.
**Recommendation:** The Decision Tree model provides a reliable tool for predicting whether a customer will shop online or in-store. Blackwell can use this model to proactively segment new customers and personalise marketing communications by channel before a purchase is even made. The model's near-perfect identification of Online customers is particularly valuable for targeted digital marketing campaigns.
# Customer Age Prediction
The second modelling task investigates whether the age of a customer can be predicted from their purchasing behaviour and demographic data. This is a regression problem; the target variable, age, is a continuous numeric value rather than a category.
Predicting customer age is valuable for Blackwell because age data may not always be available for every customer. A reliable predictive model would allow the business to estimate the age profile of new customers from transaction data alone, enabling more targeted marketing without requiring customers to explicitly provide their age.
It is worth acknowledging upfront that the EDA revealed only a weak negative correlation between age and amount spent (r = -0.282), and that age varies significantly by region. Predicting age from the available variables is therefore expected to be a challenging task, and this is reflected honestly in the results.
**Candidate predictor variables:** channel, items, amount, region
## Data Splitting
```{r}
#| label: age-split
# --- 80/20 Train/Test Split ---
set.seed(1234)
train_index_age <- createDataPartition(df$age, p = 0.8, list = FALSE)
train_age <- df[train_index_age, ]
test_age <- df[-train_index_age, ]
cat("Age Prediction — Data Split Summary:\n")
cat(" Training set:", nrow(train_age), "rows\n")
cat(" Test set: ", nrow(test_age), "rows\n\n")
# --- Verify age distribution in split ---
cat("Training set age summary:\n")
print(summary(train_age$age))
cat("\nTest set age summary:\n")
print(summary(test_age$age))
```
## Feature Selection For Age Prediction
```{r}
#| label: age-rfe
#| cache: true
# --- Recursive Feature Elimination for Age Prediction ---
# Using Random Forest as the estimator
rfe_control_age <- rfeControl(
functions = rfFuncs,
method = "cv",
number = 5,
verbose = FALSE
)
# --- Prepare features and target ---
# Convert factors to numeric for RFE compatibility
train_age_rfe <- train_age %>%
mutate(
channel_num = as.numeric(channel),
region_num = as.numeric(region)
)
rfe_features_age <- train_age_rfe %>%
select(channel_num, items, amount, region_num)
rfe_target_age <- train_age_rfe$age
# --- Run RFE ---
set.seed(1234)
rfe_result_age <- rfe(
x = rfe_features_age,
y = rfe_target_age,
sizes = c(1, 2, 3, 4),
rfeControl = rfe_control_age
)
# --- Results ---
cat("Optimal Features Selected:\n")
print(predictors(rfe_result_age))
cat("\nRFE Performance Summary:\n")
print(rfe_result_age)
# --- RFE plot ---
#| label: age-rfe-plot
ggplot(rfe_result_age) +
labs(
title = "Figure: RFE: Feature Selection for Age Prediction",
subtitle = "RMSE by number of features selected",
x = "Number of Features",
y = "RMSE (Cross-Validated)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#12121a", color = NA),
panel.background = element_rect(fill = "#1e1e2e", color = "#2a2a3a"),
panel.grid.major = element_line(color = "#3a3a4a"),
panel.grid.minor = element_line(color = "#2a2a3a"),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#e8e8e8", size = 10),
axis.title = element_text(color = "#e8e8e8", size = 11),
axis.line = element_line(color = "#8a8a9a"),
legend.background = element_rect(fill = "#1e1e2e", color = NA),
legend.text = element_text(color = "#e8e8e8"),
legend.title = element_text(color = "#e8e8e8")
)
```
## Feature Selection Results
Recursive Feature Elimination identified **three optimal features** for predicting customer age: **channel, region, and amount.**
The `items` variable was eliminated, contributing no meaningful predictive power, consistent with findings across both the EDA and the channel prediction task.
Performance improves progressively from an RMSE of 15.47 with one feature to **14.00 with three features**, with R² improving from 0.031 to 0.212. Adding the fourth feature (items) produces no improvement, confirming the three-feature model as optimal.
However, an R² of 0.212 indicates that the three selected features explain only **21.2% of the variance in customer age**, confirming the upfront acknowledgement that age is a difficult variable to predict from the available data. This is an honest and important finding that will be reflected in the model results.
All three models will be trained using channel, amount, and region as predictor variables.
## Model Training For Age Prediction
```{r}
#| label: age-models
#| cache: true
# --- Prepare train and test sets with RFE selected features ---
train_x_age <- train_age %>% select(channel, amount, region)
test_x_age <- test_age %>% select(channel, amount, region)
train_y_age <- train_age$age
test_y_age <- test_age$age
cat("Training models with features: channel, amount, region\n\n")
# --- Model 1: Linear Regression ---
set.seed(1234)
model_lr_age <- train(
x = train_x_age,
y = train_y_age,
method = "lm",
trControl = train_control,
preProcess = c("center", "scale")
)
cat("Linear Regression — Trained\n")
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_age <- train(
x = train_x_age,
y = train_y_age,
method = "rf",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 3
)
cat("Random Forest — Trained\n")
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_age <- train(
x = train_x_age,
y = train_y_age,
method = "rpart",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 10
)
cat("Decision Tree — Trained\n")
cat("\nAll three models trained successfully.\n")
```
## Age Comparison Table And Best Model
```{r}
#| label: age-comparison
# --- Predictions on test set ---
pred_lr_age <- predict(model_lr_age, test_x_age)
pred_rf_age <- predict(model_rf_age, test_x_age)
pred_dt_age <- predict(model_dt_age, test_x_age)
# --- Model comparison table ---
model_comparison_age <- tibble(
Model = c("Linear Regression",
"Random Forest",
"Decision Tree"),
RMSE = c(
round(RMSE(pred_lr_age, test_y_age), 4),
round(RMSE(pred_rf_age, test_y_age), 4),
round(RMSE(pred_dt_age, test_y_age), 4)
),
MAE = c(
round(MAE(pred_lr_age, test_y_age), 4),
round(MAE(pred_rf_age, test_y_age), 4),
round(MAE(pred_dt_age, test_y_age), 4)
),
R_Squared = c(
round(R2(pred_lr_age, test_y_age), 4),
round(R2(pred_rf_age, test_y_age), 4),
round(R2(pred_dt_age, test_y_age), 4)
)
) %>%
arrange(RMSE)
# --- Display comparison table ---
model_comparison_age %>%
kable(
col.names = c("Model", "RMSE", "MAE", "R-Squared"),
caption = "Table: Model Comparison — Age Prediction"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
row_spec(1, background = "rgba(0, 212, 255, 0.08)")
# --- Best model statement ---
best_model_age <- model_comparison_age$Model[1]
best_rmse_age <- model_comparison_age$RMSE[1]
best_mae_age <- model_comparison_age$MAE[1]
best_rsquared_age <- model_comparison_age$R_Squared[1]
cat("Best Performing Model:", best_model_age, "\n")
cat("RMSE: ", best_rmse_age, "\n")
cat("MAE: ", best_mae_age, "\n")
cat("R-Squared: ", best_rsquared_age, "\n")
```
## Actual vs Predicted Table For The Decision Tree
```{r}
#| label: age-dt-evaluation
# --- Actual vs Predicted comparison ---
age_results <- tibble(
Actual = test_y_age,
Predicted = round(pred_dt_age, 1),
Error = round(abs(test_y_age - pred_dt_age), 1)
) %>%
mutate(
Error_Band = case_when(
Error <= 5 ~ "Within 5 years",
Error <= 10 ~ "Within 10 years",
Error <= 15 ~ "Within 15 years",
TRUE ~ "More than 15 years"
)
)
# --- Error band summary ---
error_summary <- age_results %>%
group_by(Error_Band) %>%
summarise(
Count = n(),
Percentage = round(Count / nrow(age_results) * 100, 2)
) %>%
arrange(Percentage %>% desc())
# --- Display error band table ---
error_summary %>%
kable(
col.names = c("Prediction Accuracy", "Count", "Percentage (%)"),
caption = "Table: Actual vs Predicted Age — Error Band Summary"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff")
# --- Overall prediction summary ---
cat("\nAge Prediction Summary: Decision Tree:\n")
cat(" Mean Absolute Error:", round(mean(age_results$Error), 2), "years\n")
cat(" Median Error: ", round(median(age_results$Error), 2), "years\n")
cat(" Min Error: ", round(min(age_results$Error), 2), "years\n")
cat(" Max Error: ", round(max(age_results$Error), 2), "years\n")
```
## Save The Best Age Model
```{r}
#| label: save-age-model
# --- Save best model for age prediction ---
saveRDS(model_dt_age, "best_model_age.rds")
cat("Best model for age prediction saved successfully.\n")
cat("Model: Decision Tree\n")
cat("File: best_model_age.rds\n")
```
## Interpretation: Age Prediction
The Decision Tree achieved the lowest RMSE of **13.89 years** and an R-Squared of **0.2182**, marginally outperforming Random Forest (RMSE = 13.89, R² = 0.2179) and comfortably ahead of Linear Regression (RMSE = 14.10, R² = 0.1944).
However, the error band analysis reveals the honest limitations of this model:
- Only **22.30%** of predictions fall within 5 years of the actual age
- **34.43%** of predictions are off by more than 15 years, the largest single category
- The mean absolute error is **11.84 years**, meaning on average the model misjudges a customer's age by nearly 12 years
This confirms the upfront acknowledgement that age is a challenging variable to predict from the available data. With an R-Squared of just 0.2182, channel, amount, and region together explain only **21.82% of the variance in customer age**, the remaining 78.18% is driven by factors not captured in this dataset.
**Recommendation:** This model is not reliable enough for operational use in its current form. Blackwell would need richer customer data, such as browsing behaviour, product categories purchased, or loyalty programme data, to build a meaningfully accurate age prediction model. The finding itself is valuable however, confirming that transaction data alone is insufficient to infer customer age.
# Region Prediction
The third and final modelling task investigates whether the region a customer belongs to can be predicted from their demographic and transaction data. This is a multi-class classification problem, the target variable has four possible outcomes: North, South, East, and West.
Successfully predicting customer region would allow Blackwell to infer the geographic location of new or anonymous customers from their transaction behaviour alone, enabling region-specific marketing without requiring customers to explicitly provide their location.
**Candidate predictor variables:** age, channel, items, amount
## Data Splitting
```{r}
#| label: region-split
# --- 80/20 Train/Test Split ---
set.seed(1234)
train_index_region <- createDataPartition(df$region, p = 0.8, list = FALSE)
train_region <- df[train_index_region, ]
test_region <- df[-train_index_region, ]
cat("Region Prediction — Data Split Summary:\n")
cat(" Training set:", nrow(train_region), "rows\n")
cat(" Test set: ", nrow(test_region), "rows\n\n")
# --- Verify class balance in split ---
cat("Training set region distribution:\n")
print(prop.table(table(train_region$region)))
cat("\nTest set region distribution:\n")
print(prop.table(table(test_region$region)))
```
## Feature Selection
```{r}
#| label: region-rfe
#| cache: true
# --- Recursive Feature Elimination for Region Prediction ---
rfe_control_region <- rfeControl(
functions = rfFuncs,
method = "cv",
number = 5,
verbose = FALSE
)
# --- Prepare features and target ---
# Convert factors to numeric for RFE compatibility
train_region_rfe <- train_region %>%
mutate(channel_num = as.numeric(channel))
rfe_features_region <- train_region_rfe %>%
select(age, channel_num, items, amount)
rfe_target_region <- train_region_rfe$region
# --- Run RFE ---
set.seed(1234)
rfe_result_region <- rfe(
x = rfe_features_region,
y = rfe_target_region,
sizes = c(1, 2, 3, 4),
rfeControl = rfe_control_region
)
# --- Results ---
cat("Optimal Features Selected:\n")
print(predictors(rfe_result_region))
cat("\nRFE Performance Summary:\n")
print(rfe_result_region)
# --- RFE plot ---
ggplot(rfe_result_region) +
labs(
title = "Figure: RFE — Feature Selection for Region Prediction",
subtitle = "Accuracy by number of features selected",
x = "Number of Features",
y = "Accuracy (Cross-Validated)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#12121a", color = NA),
panel.background = element_rect(fill = "#1e1e2e", color = "#2a2a3a"),
panel.grid.major = element_line(color = "#3a3a4a"),
panel.grid.minor = element_line(color = "#2a2a3a"),
plot.title = element_text(color = "#ffffff",
face = "bold", size = 12),
plot.subtitle = element_text(color = "#8a8a9a", size = 10),
axis.text = element_text(color = "#e8e8e8", size = 10),
axis.title = element_text(color = "#e8e8e8", size = 11),
axis.line = element_line(color = "#8a8a9a"),
legend.background = element_rect(fill = "#1e1e2e", color = NA),
legend.text = element_text(color = "#e8e8e8"),
legend.title = element_text(color = "#e8e8e8")
)
```
## Feature Selection Results
Recursive Feature Elimination identified **three optimal features** for predicting customer region: **channel, amount, and age.**
Notably, `items` has now been eliminated across all three modelling tasks, consistently contributing no meaningful predictive power throughout this entire analysis.
Accuracy improves from 0.45 with one feature to **0.6427 with three features**. Adding the fourth feature (items) actually reduces accuracy to 0.6315, confirming the three-feature model as optimal.
All three models will be trained using channel, amount, and age as predictor variables.
## Region Model Training
```{r}
#| label: region-models
#| cache: true
# --- Prepare train and test sets with RFE selected features ---
train_x_region <- train_region %>% select(age, amount, channel)
test_x_region <- test_region %>% select(age, amount, channel)
train_y_region <- train_region$region
test_y_region <- test_region$region
cat("Training models with features: age, amount, channel\n\n")
# --- Model 1: Multinomial Logistic Regression ---
set.seed(1234)
model_lr_region <- train(
x = train_x_region,
y = train_y_region,
method = "multinom",
trControl = train_control,
preProcess = c("center", "scale"),
trace = FALSE
)
cat("Multinomial Logistic Regression — Trained\n")
# --- Model 2: Random Forest ---
set.seed(1234)
model_rf_region <- train(
x = train_x_region,
y = train_y_region,
method = "rf",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 3
)
cat("Random Forest — Trained\n")
# --- Model 3: Decision Tree ---
set.seed(1234)
model_dt_region <- train(
x = train_x_region,
y = train_y_region,
method = "rpart",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 10
)
cat("Decision Tree — Trained\n")
cat("\nAll three models trained successfully.\n")
```
## Region Comparison Table And Best Model
```{r}
#| label: region-comparison
# --- Predictions on test set ---
pred_lr_region <- predict(model_lr_region, test_x_region)
pred_rf_region <- predict(model_rf_region, test_x_region)
pred_dt_region <- predict(model_dt_region, test_x_region)
# --- Confusion matrices ---
cm_lr_region <- confusionMatrix(pred_lr_region, test_y_region)
cm_rf_region <- confusionMatrix(pred_rf_region, test_y_region)
cm_dt_region <- confusionMatrix(pred_dt_region, test_y_region)
# --- Model comparison table ---
model_comparison_region <- tibble(
Model = c("Multinomial Logistic Regression",
"Random Forest",
"Decision Tree"),
Accuracy = c(cm_lr_region$overall["Accuracy"],
cm_rf_region$overall["Accuracy"],
cm_dt_region$overall["Accuracy"]),
Kappa = c(cm_lr_region$overall["Kappa"],
cm_rf_region$overall["Kappa"],
cm_dt_region$overall["Kappa"])
) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
arrange(desc(Accuracy))
# --- Display comparison table ---
model_comparison_region %>%
kable(
col.names = c("Model", "Accuracy", "Kappa"),
caption = "Table: Model Comparison Region Prediction"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
row_spec(1, background = "rgba(0, 212, 255, 0.08)")
# --- Best model statement ---
best_model_region <- model_comparison_region$Model[1]
best_accuracy_region <- round(model_comparison_region$Accuracy[1] * 100, 2)
best_kappa_region <- model_comparison_region$Kappa[1]
cat("Best Performing Model:", best_model_region, "\n")
cat("Test Set Accuracy: ", best_accuracy_region, "%\n")
cat("Kappa Statistic: ", best_kappa_region, "\n")
```
## Actual vs Predicted Table: Decision Tree
```{r}
#| label: region-dt-evaluation
# --- Actual vs Predicted comparison ---
region_results <- tibble(
Actual = test_y_region,
Predicted = pred_dt_region,
Correct = test_y_region == pred_dt_region
)
# --- Actual vs Predicted summary table ---
region_summary <- region_results %>%
group_by(Actual, Predicted) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(
Result = ifelse(Actual == Predicted,
"Correct", "Incorrect")
) %>%
arrange(Actual, Predicted)
# --- Display actual vs predicted table ---
region_summary %>%
kable(
col.names = c("Actual", "Predicted",
"Count", "Result"),
caption = "Table: Actual vs Predicted: Region (Decision Tree)"
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff") %>%
column_spec(4, bold = TRUE,
color = ifelse(region_summary$Result == "Correct",
"#00ff9d", "#ff6b6b"))
# --- Overall prediction breakdown ---
cat("\nRegion Prediction Breakdown:\n")
cat(" Correctly classified: ", sum(region_results$Correct), "\n")
cat(" Incorrectly classified:", sum(!region_results$Correct), "\n")
cat(" Overall Accuracy: ",
round(mean(region_results$Correct) * 100, 2), "%\n")
# --- Per region accuracy ---
cat("\nPer Region Accuracy:\n")
region_results %>%
group_by(Actual) %>%
summarise(
Total = n(),
Correct = sum(Correct),
Accuracy = round(mean(Correct) * 100, 2)
) %>%
print()
```
## Save Region Best Model
```{r}
#| label: save-region-model
# --- Save best model for region prediction ---
saveRDS(model_dt_region, "best_model_region.rds")
cat("Best model for region prediction saved successfully.\n")
cat("Model: Decision Tree\n")
cat("File: best_model_region.rds\n")
```
## Interpretation: Region Prediction
The Decision Tree outperformed both Random Forest and Multinomial Logistic Regression, achieving a test set accuracy of **64.41%** and a Kappa of **0.5175**, indicating moderate agreement beyond chance for a four-class prediction problem.
The per region accuracy reveals a striking performance imbalance:
- **South, 99.92% accuracy**, virtually perfect. The South's distinct customer profile, older age, low spending ceiling of \$499.94, makes it uniquely identifiable from other regions.
- **West ,73.96% accuracy**, good performance, driven by the West's young, high-spending customer profile.
- **North, 54.36% accuracy**, moderate. North customers share demographic overlap with both East and West, making them harder to distinguish.
- **East, 20.11% accuracy**, poor. The majority of East customers are misclassified as North (1,426 cases), suggesting the two regions have very similar demographic and spending profiles that the model cannot reliably separate.
Overall, 10,302 out of 15,994 customers were correctly classified, with 5,692 misclassifications, concentrated almost entirely in the East region.
**Recommendation:** The model is reliable for identifying South and West customers, but struggles significantly with East region customers. Blackwell should treat East region predictions with caution. Richer data such as product category preferences or browsing behaviour would likely improve East vs North discrimination considerably.
# Conclusions & Recommendations
This section brings together the key findings from both the Exploratory Data Analysis and the Predictive Modelling phases into a concise, actionable summary for Danielle Sherman and Martin Goodrich. Every conclusion drawn here is grounded directly in the data, no assumptions, no generalisations.
```{r}
#| label: conclusions-summary
# --- Final findings summary table ---
conclusions <- tibble(
Business_Question = c(
"Do customers in different regions spend more per transaction?",
"Is there a relationship between items purchased and amount spent?",
"Do older customers spend more than younger customers?",
"Are in-store customers older than online customers?",
"Are there age differences across regions?",
"Can purchase channel be predicted?",
"Can customer age be predicted?",
"Can customer region be predicted?"
),
Finding = c(
"Yes, significant regional differences exist. West leads ($1,283.94), South trails ($252.10) with a spending ceiling of $499.94.",
"No, Pearson r = 0.0004. No meaningful relationship exists between items purchased and amount spent.",
"No, the data contradicts this hypothesis. Younger customers spend more. Pearson r = -0.282.",
"No, the data contradicts this hypothesis. Online customers are older (mean 48.56) than in-store (mean 42.96).",
"Yes, statistically significant age differences exist across regions (F = 6,139.42, p < 0.0001). South is oldest (56.6), West is youngest (38.8).",
"Yes — Decision Tree achieves 88.82% accuracy. Online customers identified with near-perfect precision.",
"Partially Decision Tree RMSE = 13.89 years, R² = 0.22. Model explains only 21.82% of age variance. Not reliable for operational use.",
"Partially, Decision Tree achieves 64.41% accuracy. South (99.92%) and West (73.96%) predicted well. East poorly predicted (20.11%)."
),
Recommendation = c(
"Allocate higher marketing budgets to West and East. Deploy value-oriented strategy for South. Investigate absence of high-value purchases in South.",
"Bundling and volume-based pricing strategies are unlikely to drive higher revenue. Focus on product type and price point instead.",
"Marketing campaigns targeting older customers as higher spenders are not justified. Focus high-value campaigns on younger demographics (18-39).",
"The website already attracts older customers. Focus should shift to understanding why younger in-store customers generate higher transaction values.",
"Regional marketing should account for demographic differences. South needs trust-building, value-oriented messaging. West and East suit premium campaigns.",
"Deploy the Decision Tree model to proactively segment customers by channel for targeted marketing communications.",
"Transaction data alone is insufficient to predict age reliably. Richer data, browsing behaviour, product categories are needed.",
"Model is reliable for South and West segmentation. East region predictions should be treated with caution until richer data is available."
)
)
# --- Display conclusions table ---
conclusions %>%
kable(
col.names = c("Business Question",
"Finding",
"Recommendation"),
caption = "Table: Summary of Findings & Recommendations"
) %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, bold = TRUE, color = "#00d4ff", width = "25%") %>%
column_spec(2, width = "40%") %>%
column_spec(3, width = "35%")
```
## Closing Remarks
This analysis has delivered data-driven answers to every business question posed by Danielle Sherman and tested both hypotheses raised by Martin Goodrich. The findings are clear:
**Martin's hypotheses are not supported by the data.** Older customers do not spend more, and in-store customers are not older than online customers. Any business decisions based on these assumptions would be misdirected.
**Regional differences are the most actionable insight.** The West and East regions represent Blackwell's highest-value customer segments, while the South presents a fundamentally different and more constrained spending profile that warrants its own dedicated strategy.
**The Decision Tree model is the strongest performer** across all three prediction tasks achieving 88.82% accuracy for channel prediction, and providing reliable region segmentation for South and West customers.
**The most important limitation** identified in this analysis is data richness. Transaction data alone, five variables across 80,000 records has clear predictive limits, particularly for age prediction. Blackwell's next step should be enriching their customer data collection to include browsing behaviour, product category preferences, and loyalty programme engagement. This would unlock significantly more powerful predictive models and deeper customer insights.
***Analysis conducted by Freda Erinmwingbovo ; February 2026***